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Abstract: This paper investigates the hmit behavior of Markov decision processes (MDPs) made 
of independent particles evolving in a common environment, when the number of particles goes to 
infinity. 

In the finite horizon case or with a discounted cost and an infinite horizon, we show that when 
the number of particles becomes large, the optimal cost of the system converges almost surely to 
the optimal cost of a deterministic system (the "optimal mean field"). Convergence also holds for 
optimal policies. 

We further provide insights on the speed of convergence by proving several central limits theo- 
rems for the cost and the state of the Markov decision process with explicit formulas for the variance 
of the limit Gaussian laws. 

Then, our framework is applied to a brokering problem in grid computing. The optimal policy 
for the limit deterministic system is computed explicitly. Several simulations with growing numbers 
of processors are reported. They compare the performance of the optimal policy of the limit system 
used in the finite case with classical policies (such as Join the Shortest Queue) by measuring its 
asymptotic gain. 

Key-words: Markov Decision Processes, Mean Field, Optimization, Particles System, Grid 
Broker 



Centre de recherche INRIA Grenoble - Rhone-Alpes 
655, avenue de I'Europe, 38334 Montbonnot Saint Ismier 

Telephone : +33 4 76 61 52 00 — Telecopie +33 4 76 61 52 52 



Une approche champ moyen pour I'optimisation dans les 
systemes de particules et ses applications 

Resume : Get article examine le comportement limite de processus de decision Markovien constitues 
de particules independantes evoluant dans un environnement commun, lorsque le nombre de parti- 
cules tend vers I'infini. 

Dans le cas oii on s'interesse a un cout a horizon fini ou dans le cas d'un cout a horizon infini 
avec decote, nous montrons que lorsque le nombre de particules devient grand, le cout optimal du 
systcmc converge prcsquc siircmcnt vers Ic cout optimal du systeme deterministe. La convergence 
vaut egalement pour les politiques optimales. 

De plus, nous donnons un apergu de la vitesse de convergence en prouvant plusieurs theoremes 
de la limite ccntrale pour le coiit ainsi que I'ctat moyen du processus en donnant des formules 
explicites pour la variance des lois gaussiennes limites. 

Enfin, ce modele est applique a un probleme de gestionnaire de ressources dans des grilles de 
calculi. Nous donnons un algorithmc explicitc pour calculer la politique optimale de la limite puis 
plusieurs simulations avec un nombre variable de processeurs sont etudiees. Nous comparons les 
performances de la politique optimale de la limite appliquee au systeme initiale avec plusieurs 
politiques classiqucs, (tellcs que joindre la file la plus courtc). Nous mesurons le gain asymptotique, 
ainsi que Ic seuil a partir duquel ellc surpassc les politiques classiqiics. 

Mots-cles : Processus de decision Markovien, Champ moyen. Optimisation, Systemes de parti- 
cules, Gestionnaire de ressource 
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1 Introduction 

The general context of this paper is the optimization of the behavior of controlled Markovian 
systems, namely Markov Decision Processes composed by a large number of particles evolving in a 
common environment. 

Consider a discrete time system made of N particles, N being large, that evolve randomly and 
independently (according to a transition probability kernel K). At each step, the state of each 
particle changes according to a probability kernel, depending on the environment. The evolution of 
the environment only depends on the number of particles in each state. Furthermore, at each step, 
a central controller makes a decision that changes the transition probability kernel. The problem 
addressed in this paper is to study the limit behavior of such systems when N becomes large and 
the speed of convergence to the limit. 

Several papers ([31, 0) study the limit behavior of Markovian systems in the case of vanishing 
intensity (the expected number of transitions per time slot is o{N)). In these cases, the system 
converges to a differential system in continuous time. In the case considered here, time remains 
discrete at the limit. This requires a rather different approach to construct the limit. 

In [§], discrete time systems are considered and the authors show that under certain conditions, 
as N grows large, a Markovian system made of N particles converges to a deterministic system. 
Since a Markov decision process can be seen as a family of Markovian kernels, the class of systems 
studied in [§] corresponds to the case where this family is reduced to a unique kernel and no decision 
can be made. Here, we show that under similar conditions as in Q , a Markov decision process also 
converges to a deterministic one. More precisely, we show that the optimal costs (as well as the 
corresponding states) converge almost surely to the optimal costs (resp. the corresponding states) 
of a deterministic system (the "optimal mean field"). 

On a practical point of view, this allows one to compute the optimal policy in a deterministic 
system which can often be done very efficiently, and then to use this policy in the original ran- 
dom system as a good approximation of the optimal policy, which cannot be computed efficiently 
because of the curse of dimensionality. This is illustrated by an application of our framework to 
optimal brokering in computational grids. We consider a set of multi-processor clusters (forming a 
computational grid, like EGEE 1]) and a set of users submitting tasks to be executed. A central 
broker assigns the tasks to the clusters (where tasks are buffered and served in a fifo order) and 
tries to minimize the average processing time of all tasks. Computing the optimal policy (solving 
the associated MDP) is known to be hard [l3|- Numerical computations can only be carried up to 
a total of 10 processors and two users. However, our approach shows that when the number of pro- 
cessors per cluster and the number of users submitting tasks grow, the system converges to a mean 
field deterministic system. For this deterministic mean field system, the optimal brokering policy 
can be explicitly computed. Simulations reported in Section 2] show that, using this policy over 
a grid with a growing number of processors, makes performance converge to the optimal sojourn 
time in a deterministic system, as expected. Also, simulations show that this deterministic static 
policy outperforms classical dynamic policies such as Join the Shortest Queue, as soon as the total 
number of processors and users is over 50. 

In general, how good the deterministic approximation is and how fast convergence takes place 
can also be estimated. For that, we provide bounds on the speed of convergence by proving of 
central limit theorem for the state of the system under the optimal policy as well as for the cost 
function. 
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2 Notations and definitions 

The system is composed of N particles. There are S possible states for each particle, the state 
space is denoted by S={1, . . . ,S}. The state of the nth particle at time t is denoted X!^{t). We 
assume that the particles are distinguishable only through their state and that the dynamics of 
the system is homogeneous in N . In other words, this means that the behavior of the system only 
depends on in the proportion of particles in every state i. For all i € 5, (M^) . Y^n=i ^x^(t)=i 
is the proportion of particles in state i and we denote by the vector ((M^)^ . . . (M^)^). The 
set of possible values for is the set of probability measures p on {1 ... 5}, such that Np{i) € N 
for all j G 5, denoted by Vn{S)- For each N, Vn{S) is a finite set. When N goes to infinity, it 
converges to V{S) the set of probability measures on S. 

The system of particles evolves depending on their common environment. We call C S M'' the 
context of the environment. Its evolution depends on the mean states of the particles M^, itself at 
the previous time slot and the action at chosen by the controller (see below): 

Cf+i=g(Cf,Mf+i,aO, 
where g : Vn{S)xR'^ ^ M'' is a continuous function. 

2.1 Actions and policies 

At each time t, the system's state is M € VNiS). The decision maker may choose an action a from 
the set of possible actions A. A is assumed to be a compact set (finite or infinite). The action 
determines how the system will evolve. For an action a £ A and an environment C S M'^, we have 
a transition probability kernel K{a, C) such that the probability that a particle goes from state i 
to state the j is Ki j{a, C): 

F{X^{t + 1) = j\X^{t) ^i,at = a, Cf = C) = K,^,{a, C). 

The evolutions of particles arc supposed to be independent once C is given. Moreover, we assume 
that Ki_j{a, C) is continuous in a and C. The assumption of independence of the users is a rather 
common assumption in mean field models Q . However other papers 0, Q have shown that similar 
results can be obtained using asymptotic independence only (see for results of this type) . 

Here, the focus is on Markov Decision Processes theory and on the computation of optimal 
policies. A policy H = (Hi ... Hi ... ) specifies the decision rules to be used at each time slot. A 
decision rule Hf is a procedure that provides an action at time t. In general, Hj is a random 
measurable function that depends on the events ((Mi, Ci) . . . (Mt, Ct)) but it can be shown that 
when the state space is finite and the action space is compact, then deterministic Markovian policies 
(i.e. that only depends deterministically on the current state) are dominant, therefore we will only 
focus on them [14i] . 

2.2 Reward functions 

To each possible state (M, C) of the system at time i, we associate a reward rt(M, C). The reward is 
assumed to be continuous in M and C. This function can be either seen as a reward - in that case 
the controller wants to maximize the reward -, or as a cost - in that case the goal of the controller 
is to minimize this cost. In this paper, we will focus on two problems: finite-horizon reward and 
discounted reward. 
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In the finite-horizon case, we want to maximize the sum of the rewards over all time t <T plus 

N c^N^ 
T ' J 



a final reward that depends on the final state, rT(My , C^). The expected reward of the policies 



<...n.(M^,C^) =E 



Ho, . . . ,nT-i is: 

X:rt(Mf,Cf)+rT(M^,C?) , 
- t=i 

where the expectation is taken over all possible (M^,C^) when the actions are nt(M^,Cf^), 
for all t. 

Let < (5 < 1, the discounted reward associated to (5 and the policy Hq . . . lit . . . is the quantity: 



^<5*r,(Mf,Cf) 



Again, the expectation is taken over all possible (M^,C^) when the actions at time t is 
nt(Mf,Cf), for all t. 

In both cases, the goal of the controller is to find a policy that maximizes the expected reward: 

y*^(Mo^,Co^)1^^ sup Fn^...n.(Mo^,Co^), 
ni...nT 

y(*,^(Mj^,Co^) sup y(^),n,...(Mj^,Cj^). 

Hi... 

2.3 Summciry of the assumptions 

Here is the list of the assumptions under which all our residts will hold, together with some comments 
on their tightness and their degree of generality and applicability. 

(Al) Independence of the users, Markov system - If at time t if the environment is C and the 
action is a, then the behavior of each particle is independent of other particles and its evolution 
is Markovian with a kernel K[a, C). 

(A2) Compact action set - The set of action A is compact. 

(A3) Continuity of K,g,v- the mappings (C, a) ^ K{a, C), (C, M, a) ^ g{C, M, a) and (M, C) ^ 
rt(M, C) are continuous deterministic functions, uniformly continuous in a. 

(A4) Almost sure initial state - Almost surely, the initial measure M^,C^ converges to a 
deterministic value too,co. Moreover, there exists B < oo such that almost surely ||Co'||oo < B 

where ||C||oo = supj \Ci\. 

To simplify the notations, we choose the functions C and g not to depend on time. However as 
the proofs will be done for each time step, they also hold if the functions are time-dependent (in 

the finite horizon case). 

Also, K, g and r do not to depend on N, while this is the case in most practical cases. Adding a 
uniform continuity assumption on these functions for all N will make all the proofs work the same. 

Here are some comments on the uniform bound B on the initial condition (A4). In fact, as Cj^ 
converges almost surely, Cq' is almost surely bounded. Here we had a bound B which is uniform 
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on all events in order to be sure that the variable is dominated by an integrable function. As g 

is continuous and the sets A and are compact, this shows that for all t, there exists Bt < oo 
such that 

llCf||oo<Bt. (1) 

Finally, in many cases the rewards also depend on the action. This is not the case here, at a 
small loss of generality. 

3 Convergence results and optimal policy 

In the case where there is no control, one can adapt the results proved in ^ to show that when N 
goes to infinity, the system converges almost surely to a deterministic one. In our case, this means 
that if the actions are fixed, the system converges. 

For any fixed action a and any value M e Vn{S), we define the random variable (M, C) that 
corresponds to the state of the system M', C after one iteration started from M, C. For m S P(5), 
we define $0(771, c) the (deterministic) value corresponding to one iteration of the mean field system: 
^a{mt,ct) = {mt+i,ct+i) where 

mt+i = mt.K{a,Ct) 
ct+i = g{mt+i,ct). 

We call ^ao-aT-i (^esp. $ao...aT-i) the compositions of (resp. of • ■ • ^a^-i )• 

In Q, the system is homogeneous in time. However, the proofs are done for each step time and 
the results still hold without time homogeneity. With our notations, theorem 4.1 of [1] says that if 
the actions are qq . . . ot-i, and if the initial state converges almost surely, then the system of size 
N converges almost surely. 

Theorem 1 (Mean Field Limit, th. 4.1 of Q). Under assumptions (Al,A3,A4-), if the controller 
takes the actions at at time t, then for any fixed T : 

(Mf,Cf)^ (7770, Co). 

In the following, we will first show that if we fix the actions, the total reward of the system 
converges when N grows, then we will show that the optimal reward also converges. 

3.1 Finite horizon model 

In this section, the horizon T is fixed, the infinite horizon case will be treated in Section [3.31 Using 
the same notation and hypothesis as in Theorem[Tl we define the reward of the deterministic system 
starting at 7770, cq under the actions ao, . . . , at-i'- 

T 

(7770, Co) = X!'"*^* 

{mo, Co)). 

t=l 

For any t, if the action taken at instant t is fixed equal to at, then {M^ , C^) converges almost 
surely to {mt,ct). Since the reward at time t is continuous, this means that the finite- horizon 
expected reward converges as N grows large: 
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Lemma 2 (Convergence of the reward). Under assumptions (Al,A3,A4-), if the controller takes 
actions oq . . . ot-i, the finite-horizon expected reward of the stochastic system converges to the 
finite-horizon reward of the deterministic system: 

Jim C...a._,«.C^) = «a„,...,a._,(mo,co) a.s. 

iV — *oo 

Proof. For all t, (M^,C^) converges almost surely to {mt,Ct)- Since the reward at time t is 
continuous in (M, C), then rt(M^,C^) it{nit,Ct)- Moreover, as (M, C) are bounded (see 
Equation ([T])), the dominated convergence theorem shows that E[rt(Mf', )] goes to rt{mt,ct) 
which concludes the demonstration. □ 

Now, let us consider the problem of convergence of the reward under the optimal strategy 
of the controller. First, it should be clear that the optimal strategy exists for the limit system. 
Indeed, the limit system being deterministic, starting at state (mo,co), one only needs to know 
the actions to take for all (toj, ct) to compute the reward. The optimal policy is deterministic and 

def 

v^{mo,CQ) = supjj^ ^y_j^{tiQo...aT_i (tio, Co)}. Since the action set is compact, this supremum is a 
maximum: there exist Og . . . a^_-^ such that Wy(mo, co) — Va* ...a* (niQ, cq). In fact, in many cases 
there are more than one optimal action sequence. In the following, . . . a^_i is one of them, and 
will be called the sequence of optimal limit actions. 

Theorem 3 (Convergence of the optimal reward). Under assumptions (Al,A2,A3,A4), as N goes 
to infinity, the optimal reward of the stochastic system converges to the optimal reward of the 
deterministic limit system: almost surely, 

hm ^^^(aCCo^) = hm K^...,. (A/„^,0 = 4(mo,co) 

In words, this theorem says that, at the limit, the reward of the optimal policy under full 
information T/*^(Mo^,C^) is the same as the reward obtained when the optimal limit actions 
(og . . .a'^_i) are used in the original system, both being equal to the optimal reward of the limit 
deterministic system, v^{mo, co). 

Proof For aU and < i < T and (M, C) G Fn{S)xR'^, let us define by induction on t the 
function V^*^j,: 

y^^r(M,C) = rT(M,C) 

V;\{M, C)=ri(M, C)+ sup Em,c[K;1..t«(M, C))]. (2) 

where the expectation Em,c[-] is taken over all possible values of $^(M, C) given (M, C). Also 
notice that Vj*^(M, C) is the maximal expected reward between time t and time T starting in 
(M, C) and therefore Vq*'^j, = Vf^. 

Let us also define for the limit system, similarly (by removing the expectation): 



(to, c) = rT(TO, c) 



vl rp{m,c) = rf(m,c) + sup v^^J^ j.(^aim,c)) 

aeA L 



(3) 



and let 11^ (m, c) be an action that maximize the sup in the previous equation (it exists because of 
(A2): A is compact). 
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We will show by induction on < < T that V'j*^(-, •) is continuous (note that since M e V^{S) is 
discrete the continuity in M is trivial) and that we can define an optimal policy {M, C), such 
that: 

y,:^^(M, C)=r*(M, C)+E[f;+1..^«.„(^ C))] . (4) 

For t — T, the assumption holds by the continuity of r (A3). 

Let us assume that it holds for t + I < T. By assumption (A3), the mapping g and the 
kernel K are continuous in a thus if {a(fc)}fcgN is a sequence of action converging to a, 

converges (in law) to <1>^. As Vf*^...rp is continuous, a i-^ E[Vf*_^ j,{^^{M,C))] is continuous. 
Using this continuity and the compacity of A, the optimal action nj'^(M, C) G A exists. The 
functions r, g, K are uniformly continuous in a, therefore the convergence of the continuity of 
the function a ^ s\xp^'&[V*^^ j.{^^ {M.C))] is uniform in M,R. This shows that [M,R) ^ 
supa E[V(!|f^ -j,($^ (M, C))] is continuous and the property for all t is proved. 

Let us now prove by induction on t that for all sequences (M^, C^) converging almost surely 
to (to, c), v^^rp{M^ ,C^) v1 j.[m,c). This is clearly true for t=T. Assume that it holds for 
some t+l<T and let us call . . .a'^^i a sequence of optimal actions for the deterministic limit. 
Lemma H shows that V^, _ (M^,C^) ^ Wa,*...a- _j (m, c) = w* j,(to,c). In particular, this 
shows the second inequality (which holds a.s.) of the following equation: 

liminfF,*^j,(M^,C^) > liminf Vjy ^. ^(M^,C^) 

= W( 7- (to, c). 

Let a*^ be a sequence of actions maximizing the expectation in ([2]). As ^ is compact, there exists 
a subsequence a*'^'^-' converging to a value a. Again by lemma[21 the limsup of r(M'^'^^\ Q^i'^)'^ -\- 
E[y;/i(^^($^(^)(M^W,C'''W))] converges a.s. to r(TO, c) + Vt*+i($a(TO, c)) < vl j,{m,c). Using 

both inequalities, this shows that V*^^^^\wI''^^\G'I'^^'^) ^ v*{m,c). 

To conclude the proof, remark that since the limit system is deterministic and takes the values 

(too, Co), ... , (TOf , ct), fixing the policy at time t to the action Oj n*(TOt, Ct) achieves the optimal 
reward. □ 

This result has several practical consequences. Recall that the limit actions . . .a'^_i is a 
sequence of optimal actions in the limit case, i.e. such that (to, c) = v^{m,c). This 

result proves that in the limit case, the optimal policy does not depend on the state of the system. 
This also shows that incomplete information policies are as good as complete information policies. 
However, the state (M^, C^) is not deterministic and on one trajectory of the system, it could be 
quite far from its deterministic limit (TOt,Ct). In the proof of proposition [21 we also defined the 
policy Ilj (M^, C^) which is optimal for the deterministic system starting at time t in state mt, rj. 
The least we can say is that this strategy is also asymptotically optimal, that is: 

hm Fn^...n^(M,C)= hm KX..a*(M,C). 

In practical situations, using this policy will decrease the risk of being far from the optimal state. 
On the other hand, using this policy has some drawbacks. The first one is that the complexity of 
computing the optimal policy for all states can be much larger than the complexity of computing 
Cq . . . a^_]^. An other one is that the system becomes very sensitive to random perturbations: the 
policy n* is not necessarily continuous and may not have a limit. In Section 01 a comparison 
between the performances of Oq . . . a^^-i ^^"^ . . . U^ -^ is provided over an example. 
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3.2 Central Limit Theorems 

In this part we prove central limit theorems for interacting particles. This result provides estimates 
on the speed of convergence to the mean field limit. This section contains two main results: 

The first one is that when the control action sequence is fixed, the gap to the mean field limit 
decreases as the inverse square root of the number of particles. The second result states that the 
gap between the optimal reward for the finite system and the optimal reward for the limit system 
also decreases as fast as 1/ v^- These properties are formalized in theorems [5] and [4] respectively. 

To prove these results, we will need additional assumptions (A4-bis) and (A5) or (A5-bis). 

(AlD-bis) Initial Gaussian variable - There exists a Gaussian vector Gq of mean with covariance 
To such that the vector a//V((M^, C^)— (mo, co)) (with S+d components) converges in law to 

Gq. (This is denoted as Vn{(Mq,Cq) — (mo,co)) ^ Gq). This assumption also includes (A4), 
i.e. almost sure convergence of the initial state. 

(A5) Continuous differentiability - For all t and all i,j S S, all functions g, Kij and it are 

continuously differentiable. 

(A5-bis) Differentiability in oq . . . ot-i ~ Let {nit, q) be the deterministic limit of the system if 
the controller takes the actions oq . . . ot-i then for all i,j € iS, the functions g, Kij and rj are 
differentiable in the points {mt,ct). 

These assumptions are slightly stronger than (A3) and (A4) but remain very natural. (A4-bis) 
is clearly necessary for Theorems [5] and |4] to hold. The differentiability condition implies that if the 
gap between Mt and mt is of order 1 / \/N, it remains of the same order at time < + 1. For Theorem 
El (A5-bis) is necessary but can be replaced by a Lipschitz continuity condition for TheoremH) This 
will be further discussed in Section 

Theorem 4 (Central limit theorem for costs). Under assumptions (Al,A2,A3,A4bis,A5), 
(i)- there exists constants f3and 7 such that for all x: 



limsupP(ViV Vf'^ {M^ ) - v*T{mQ,co) 



- (6) 

<P(/3||Go|U+7>^); 



(a)- there exist constants (3',"/' > such that for all x 

>x) <P(/3'||Go||oo+7'>^); 



limsupP(V7V 



(7) 



where ||G'||oo = sup, |G^| 



This theorem is the main result of this section. The previous result (Theorem [3]) says that 
limsupjv^^y^*^(M^,C^) = limsup^_^K3...a^_,(M^,C^) = <..^(mo,co). This new theorem 
says that both the gap between the cost under the optimal policy and of the cost when using the 
limit actions (i) or the gap between the latter cost and the optimal cost of the limit system (ii) are 
random variables that decrease to with speed -v/ZV and have Gaussian laws. Actually, a stronger 
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result (using almost sure convergence instead of convergence in law) will be shown in Corollary [51 
A direct consequence of this result is that there exists a constant 7" such that: 



E[^/]V|y;^(M^, C^) - v*r{mo, co)\ 



1 



(8) 



The rest of this section is devoted to the proof of this theorem. A first step in the proof of 
Theorem m is a central limit theorem for the states, which has an interest by its own. 

Theorem 5 (Mean field central limit theorem). Under assumption (Al,A2,A3,A4bis,A5-bis), if the 
actions taken by the controller are og . . . ax-i, there exist Gaussian vectors of mean 0, Gi . . . Gt-i 
such that for every t: 



ViV((M^,C^)-(mo,co),. 



Go, ■ ■ ■ ,Gt. 



(9) 



Moreover ifTt is the covariance matrix of Gt, then: 



t+i 



■ Pt 


Ft 


Qt 





' Pt 


Ft ' 


_ Qt 





' Dt 


" 









(10) 



where for all 1 < i,j < S and I < k,£ < d: {Pt)ij^Kij{at, ct) , {Qt)kj^J2i=i'^i^^i(^t,Ct), 
{FtU = §l^{fnt+i,Ct), {Ht)M = ^,{rnt,ct), {Dt),, = T.l=lm^{Pth{^ - {PtW and {Dt),k = 

Proof. Let us assume that the Equation © holds for some t > 0. 

As ((M^, C^)( — (m, c)t) converges in law to Gt, there exists another probability space and 
random variables and with the same distribution as and such that \/N{{M^ , C^)t — 
(m, c)i) converges almost surely to Gt In the rest of the proof, by abuse of notation, we will 
write M and C instead of M and C and then we assume that y/N{{M^ , C^)t — (m, c)t) Gt- 

Gt being a Gaussian vector, there exists a vector of S+d independent Gaussian variables U = 
(ui, . . . , us+d)"^ and a matrix X of size {S+d)x{S+d) such that Gt = XU. 

Let us call P^^ '= K{at,Cf). According to lemma [6] there exists a Gaussian variable Ht 
independent of Gf and of covariance D such that we can replace M^j^ (without changing Mf and 
Ct) by a random variables M^j^ with the same laws such that: 



Ht 



MfP-) 

In the following, by abuse of notation we write M instead of M. Therefore we have 

VN{Mf[,~mtPt) = \/]v(Mt+i-MfPf +mt(Pt'^-Pt)- 
(Mf-TOt)-Pt + {Mf'~mt){Pt''-Pt) 



(11) 



iJt + mt Jim V7V(P,^-Pt)+ lim \/lV(Mf -mt)Pt. 
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By assumption, lim\//V(M^ — mt)i — {XU)i. Moreover, the first order Taylor expansion with 
respect to all component of C gives a.s. 



hm m.VNiPi' ~ Pt), = E ™*. E 

iV — ^ oo ' ' — 



d 



k=l 



{at,ct){XU)s+k 



Y,Qkj{XU)s+k. 



fc=i 



Thus, the jth component of v]V(M^^ — mtPt) tends to 



Ht+Y. Qkj{XU)s+k + Y.{XU),P,j 



(12) 



fc=i 



Using similar ideas, we can prove that \fN (C^— Cj^) converges almost surely to X]i=o 

^£=0 ^^i'^^)s+t- Thus \/]V((M^i, C^i) — (mt+i, ct+i)) converges almost surely to a Gaussian 
vector. 

Let us write the covariance matrix at time t and time t + 1 as two bloc matrices: 



M 


o ' 




c 



and Ff+i 



M' 


O' ' 




c 



For 1 < < S, M'j J, is the expectation of (fT2|) taken in j times (fT2|) taken in j' . Using the facts 
that E[{XU)s+k{XU)s+k'] - Ckk', E[{XU)s+kiXU),] = O.k and E[(XC/),(XC/),,] = M,,-, this 
leads to: 

m;. - E[i/ji/j] + g QkjQk'fCkk' + g Qfc.O^'fci^,'/ 
By similar computation, we can write similar equations for O' and C that lead to Equation 

(nni). □ 

Lemma 6. Let he a sequence of random measure on {1, . . . , S'} and P^ a sequence of random 
stochastic matrices on {1, . . . , S*} such that (M^ ,P^) {m,p). Let {Uik)i<i<s^k>i be a collection 
of iid random variables following the uniform distribution on [0; 1] and independent of P^ and 
and let us define : for all \ < j < S: 



S A'Mf 



i=l k=l 



then there exists a Gaussian vector G independent of and P^ and a random variable with 
the same law as such that 

y^{Z^ - M^P^) ^ G. 
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Moreover the covariance of the vector G is the matrix D: 

D33 = E^"^iPy(l-Py) (13) 

Proof. As (M^,P^) and {Uik)i<i<s,k>i are independent, they can be viewed as functions on 

independent probability space and Q'. For all {lo,lo') G fixO', let (uj') =^ \/Jj{Y^ {uj,uj') - 
M^(u;)P^(w)). 

By assumption, for almost all w e 51, (M^ (uj), (uj)) converges to {m,p). A direct computation 
shows that, when N grows, the characteristic function of X/^ converges to exp(— ^^-^ J2i=i ''^i^iO- 
Therefore for almost all oj, X^ converges in law to G, a Gaussian random variable on Q! . 

Therefore for almost all tj, there exists a random variable X^^ with the same law as X^ that con- 
verges w'-almost surely to G(w'). Let Z^(u},uj') =^ (oj)P^ (uj) + j^X^{uj'). By construction of 



LO.uj —a.s 



Xj, for almost all w, Z^{uj, .) hasjhe same distribution as and VN{Z^ ~Y^P^) 

G. Thus there exists a function {ui, .) that has the same distribution as Y^{llj) for all uj and 
that converges (w, (jj')-almost surely to G. □ 

The first application of the mean field CLT is to show that it also works for the cost. Let 
us assume that the controller takes actions qq . . . ay-i and let us introduce the definition of 
<...a,_i(M^:C^) =Er=i(rt(Mf,Cf)) andr,„...,,_,(mo,Co) =ELirtK,Ct). Lemmal says 
that Rao...aT-i(^o ^^o) ~^ ''ao...aT-i ("^0, Co), the foUowing results is more accurate: 

Corollary 7 (Application of the CLT to reward). Under assumption (Al,A2,A3,A4-bis,A5-bis), if 
the controller takes the actions oq . . . aT-i and if we call Drf (7714, ct) the differential o/rt(M, C) at 
the point (mt,Ct), we have: 

(14) 



NiK...a^^AK'Co) ~ ^ao...a.-,(mo,Co)) 

^ J2t=i T>Tt{mt,Ct)Gt. 



Proof. Let Gq ■ ■ ■ Gt be the Gaussian variables defined in the central limit theorem. The proof 
of Theorem [S] says that one can replace (M^,C^) by variables with the same law such that the 
convergence is almost sure. Let to be an event such that limAr -\//V((M^(w), Cf (w)) — (m, c)f)) = 
Gt{u>). For this event, we have limAf__oo ^/N{ct{M^ ,C^) — itimtTCt)) = T>rt{mt, ct)Gt which leads 
to Equation (|14|) by using a Taylor expansion at order one. □ 

As the means of the Gaussian variables are 0, we have directly: 

Corollary 8. Under the same assumptions and if the convergence of the initial condition is almost 
sure ({Mq,Cq ) {mo,co)), one has: 



C.a._, (M^, C^) - Vao...a^-, (™0, Cq) 

<7V-foo |Dro(mo, ct)Go| a.s 



(15) 



Proof vZ...a^_, {M^, O - Va„...ar-^ (^0, Cq) = , ) - r(mo, Co) +Em«,c~ K..TiK, Cf ) - 

i'i...t(toi: ci)]. As \^{{Mq,Cq) — (mo, Co)) converges almost surely, the first part of the sum 
can be upper bounded by |Dro(7Tio, co)Go|. As for the second part of the sum, using the Berry- 
Esseen Theorem (Durrett 2.4.d f9]), one can refine Lemma [S] and show that the convergence is 



INRIA 



A Mean Field Approach for Optimization in Particles Systems and Applications 



13 



uniform. Therefore one can switch the expectation and the hmit, the second part of the sum becomes 



7V(r™ j,(Mf , Cf ) - ri...T(mi, ci))] =a.s which proves Equation ((T5l 



□ 



We are now ready for the proof of Theorem [H 



of theorem^ For a vector G, let us write ||G||i = J^i 1^* 
exists a compact set B such that for all t from to T, , 

Let us prove by induction on t from T to that there exist f3t,jt G 
a Gaussian variable Gt satisfying -\/]V ((Mf^, C^) — {mt,Ct)) Gt, then 



Because of assumption (A4), there 



will remain in B. 



such that if there exists 



lim supjv- 



rp{mt,ct) 
</3t\\Gt\\oc+7t 



(16) 



For t = T, Corollary [8] can be used to transform Equation 
||Drt(mT, CT)|ji||G'T||oo- Therefore, Inequality (fT6|) is true if /3t 

Let us assume that ([TBI) holds for some t + 1 < T and that 
At time t, (|16p can be upper bounded by: 



HI) into VN\'DTTimT,CT)GT\ < 
-- ||Drt(mT, ct)|1i and 7t = 0. 



7V((Mf,Cf)-(mt,cO) 



7V|ri(Mf,Cn-rt(™t,cO| 



-sup„ vl rj.{^airnt,ct)) 



The first part can be bounded by ||Dr((mt, C()||i||Gt||oo- The rest of the proof focuses in the 
second part of the sum. In the proof of Theorem O we showed that for all a (up to the replacement 
of (M^, Cf') by a random variable with the same law), there exists a matrix Pa and a Gaussian 
variable Ga independent of G* such that ViV(((Mf , Gf ), (M^^, C^i)) - ((m*, ct), (mt+i, q+i))) 
converges almost surely to {Gt,PaGt + Ga)- Using the fact that sup^ /(a) — sup^, .g(a) < sup„(/(a) — 
g(a)), the expectation can be upper bounded by: 



sup ViVE 



y:l\..AK (Mf , Gf)) " <+i...T(*a(m*, cO) 



Let us consider an arbitrary action a. The Berry-Esseen Theorem shows that '\/]V((M^]^, G^j^)— 
(mt+i,ct+i)) — PaGt converges uniformly to Ga, therefore we can switch the limit in N and 
the expectation and by induction, it can be upper bounded by EG[7t||^'aGt + Ga||oo + /3t+i] < 
/3t+i||^aGt||cx) +7t + /3tE[||Ga||oo]- As A is compact and (M^j,G^2) remains in a compact set B 



(Equation (P)), sup„g^_(M,c)GB < oo and sup„g_4 (M,c)eB 

uniform bound on all (M, C), taking /3t =^ /3t+i sup_4 g ||-Pa||i and 7t =^ 7t+i+A+i sup_4 g 



IE[||Ga|loo] < oo. Thus to obtain an 

GqIIoo] 
satisfy (fTH]). 

Assumption (A4bis) says that at time i = 0, VlV ((Mf^, Gf^) — {mt,ct)) Gt holds in distribu- 
tion. Using appropriate random variables {Mf,Cf) with the same laws as {M^,C^) makes this 
convergence almost sure so that the induction above holds from t = 0. This ends the proof for 
assertion i of the theorem. 
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As for assertion ii, it comes from the triangular inequality 



< 



Vf''{M^,C^)-v*Timo,co) 



+ 



An upper bound on the first term of the right side comes from assertion i and the second term can 
be bounded using Corollary [8] This ends the proof. □ 



3.3 Infinite horizon discounted reward 

In this section, we prove the first order results for infinite-horizon discounted Markov decision 
processes. As in the finite case, we will show that when N grows large, the maximal expected 
discounted reward converges to the one of the deterministic system and the optimal policy is also 
asymptotically optimal. To do this , we need the following new assumptions: 

(A6) Homogeneity in time - The reward rt and the probability kernel Kt do not depend on 
time: there exists r, K such that, for all M, C, a rt(M, C) r(M, C) and Kt{a, C) = K{a, C). 

(A7) Bounded reward - supjyj ^ r(M,C) <K<oo. 

The homogeneity in time is clearly necessary as we are interested in infinite-time behavior. 
Assuming that the cost is bounded might seems strong but it is in fact very classical and holds in 
many situation, for example when C is bounded. The future reward are discounted according to 
a discount factor < ^ < 1: if the policy is 11, the expected total discounted reward of 11 is {S is 
omitted in the notation): 



oo 



Notice that Assumption (A7) implies that this sum remains finite. The optimal total discounted 
reward y*^ is the supremum on all policies. For T G N, the optimal discounted finite-time reward 
until T is 

T 

y^*^(Mo, Co) sup En [ V S'-hiUt, C*)] . 
As r is bounded, one can show that it converges uniformly in (M, C) to y*^: 



lim sup 



Vf'^{M,C)~V*'^{M,C) =0. (17) 



Equation (fT7|) is the key of the following analysis. Using this fact, we can prove the convergence 
when N grows large for fixed T and then let T go to infinity. Therefore with a very few changes in 
the proofs of Section [XTl we have the following result: 
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Theorem 9 (Optimal discounted case). Under assumptions (Al,A2,A3,A4,A6,A7), as N grows 
large, the optimal discounted reward of the stochastic system converges to the optimal discounted 
reward of the deterministic system: 

lim V*''{M^,G^)=a.sV*im,c), 

AT— >oo 

where v*{m,c) satisfies the Bellman equation for the deterministic system: 

v*{m, c) = r{m, c) + 6 sup < v*{^a{m, c)) >. 

3.3.1 Problems for other infinite horizon criteria 

Again, the discounted problem is very similar to the finite case because the total reward mostly de- 
pends on the rewards during a finite amount of time. As for other other infinite-horizon criteria such 
as average reward or its variants, the average reward is (if it exists) lim^^oo X]t=i c(Mt, Cj). 

This raises the problem of the exchange of the limits N ^ oo and T oo. Consider a case 
without control with two states iS={0; 1} and Ct is the mean number of particles in state 1 (Ct = 
(Mt)i) and with a function /:[0; 1]^[0; 1] such that the transition kernel K is Kii{C) = /(C) for 
i £ S. If (0) mn then for any fixed t, converges to /(/(. . . /(wq) .•■))• Using techniques 
that can be found in [7| , one can prove that as N grows large, limt^oo might converges to 
almost any subset of Lc[0;l] such that L = f{L). However, in general limt^oo liniAr^oo 
limAT^oo limt^oo M^. For example if f{x) = x, the deterministic system is constant while the 
stochastic system converges almost surely to a random variable (as a bounded Martingale) that 
takes values in {0; 1}. 

Similar difficulties arise for the central limit theorem in the discounted case: the convergence 
depends on the behavior of the system when T tends to infinity. 



4 Application to a brokering problem 

To illustrate the usefulness of our framework, let us consider the following model of a brokering 
problem in computational grids. There are A application sources that send tasks into a grid system 
and a central broker routes all theses tasks into d clusters (seen as multi-queues) and tries to 
minimize the total waiting time of the tasks. A similar queuing model of a grid broker was used in 

Here, time is discrete and the A sources follow a discrete on/off model: for each source j E 

{1...A}, let (r/) = 1 if the source is on {i.e. it sends a tasks between t and t + 1) and if it is 

off. The total number of packets sent between t and t + 1 is Yt'^^ J2j ■ Each queue z £ {1 . . . d} 
is composed of Pi processors, and all of them work at speed fj.i when available. Each processor 

j G {1 ... Pi} of the queue i can be either available ( in that case we set Xl'' =^ 1 ) or broken (in 

that case X^-^ 0) . The total number of processors available in the queue i between t and f -I- 1 is 

XI J2j ^t' ^''^d we define Bl to be the total number of tasks waiting in the queue i at time t. 
At each time slot t, the broker (or controller) allocates the It tasks to the d queues: it chooses an 
action at G "^({1 . . •^t}'') and routes each Yt packets in queue i with probability a\. The system 
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is represented figure [T] The number of tasks in the queue i (buffer size) evoives according to the 
following relation: 

Bl+,^(^Bi~fiai + alYtY. (18) 




Pi procs 



Pd procs 



Figure 1: The routing system 

The cost that we want to minimize is the sum of the waiting times of the tasks. Between t and 

t + 1, there are J^i^l tasks waiting in the queue, therefore the cost at time t is rt{B) J^i^l- 
As we consider a finite horizon, we should decide a cost for the remaining tasks in the queue. In 
our simulations, we choose rxiB) =^ B\,. 

This problem can be viewed as a multidimensional restless bandit problem where computing the 
optimal policy for the broker is known to be a hard problem [T^ . Here, indexability may help to 
compute near optimal policies by solving one MDP for each queue [l3, • However the complexity 
remains high when the number of processors in all the queues and the number of sources are large. 



4.1 Mean field limit 

This system can be modeled using the framework of particles evolving in a common environment. 

• There are iV = A + "particles". Each particle can either be a source (of type s) or a 
server (belonging to one of the queues, qi - ■ ■ qd), and can either be "on" or "off" . Therefore, the 
possible states of one particle is an element of 5 = {{x,e)\x £ {s, gi, • • • , qd}, e G {on, off}}, the 
population mix M is the proportion of sources in state on and the proportion of servers in state 
on, for each queue. 

• The action of the controller are the routing choices of the broker: af is the probability that a 
task is sent to queue d at time t. 

• The environment of the system depends on the vector Bt = {Bt^ ■ ■ ■ Btd), giving the number of 
tasks in queues qi, . . . qd at time t. The time evolution of the i-th component is 

Bt+u = 5,(Bt,M^i,at) (St. - M.^t + ^Y^^ ■ 

The shared environment is represented by the context Cf '^-^ (-^ . . . -%^). 

• Here, the transition kernel can be time dependent but is independent of a and C. The probability 
of a particle to go from a state (x, e) € 5 to (y, /) € 5 is if a; 7^ y (a source cannot become a 
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server and vice-versa). If a; = y then A'(2:,on),(2:,off) (o, C')(0 as well as if(a:,off),(2:,on) C')(0 are 
arbitrary probabilities. 

Here is how a system of size N is defined. A preliminary number of sources Aq as well as a 
preliminary number Pi of servers per queue is given, totaling in Nq particles. For any N, a system 
with N particles is composed of [AqN/Nq\ (resp. [P^iV/iVoJ) particles that are sources (resp. 
servers in queue i). The remaining particles (to reach a total of N) are allocated randomly with 
a probability proportional to the fractional part of A/Nq and PiN/N^ so that the mean number 
of particles that are sources is A/Nq and the mean number of particles that are servers in queue 
i is PiN /Nq. Then, each of these particles changes state over time according to the probabilities 
Ku,v{a,C){t). At time t = 0, a particle is in state "on" with probability one half. 

It should be clear that this system satisfies Assumptions (Al) to (A4) and therefore one can 
apply the convergence theorem [3] to this system that shows that if using the policies a* or 11*, when 
N goes to infinity the system converges to a deterministic system with optimal cost. An explicit 
computation of the policies a* and 11* is possible here and is postponed to Section 14.31 

4.2 CLT applicability 

As for the central limit theorem. Assumption (A4-bis) on the convergence of the initial condition to 
a Gaussian variable is true since the random part of the initial state is bounded by and \/N ^ 
goes to as grows. Unfortunately Assumption (A5) does not hold since the function g is not 
differentiable when G\— iiiXl+a\Yt = 0. However, as mentioned in the beginning of section [3?2l the 
differentiability condition in Assumption (A5) can be replaced by a Lipschitz continuity condition. 
Let us consider Assumption (A5-ter): 

(A5-ter) Continuous Lipschitz - For all t and all i, j G 5, all functions g, Kij and rt are Lipschitz 
continuous on all compact sets of their domain. 

This assumption is weaker than (A5) since, if a function is C^, it is Lipschitz on every compact 
set (with Lipschitz constant sup||/'||). In the example, function g has a right-derivative and a 
left-derivative at all points and therefore satisfies ( A5-ter) . The central limit theorem [4] should 
apply here as well: 

Theorem 10. Theorem^ still holds when replacing (A5) by (A5-ter). 

(Sketch of the proof). The proof is very similar to the one of[?]and we just sketch the main differ- 
ences. 

As seen at the end of section [^751 all variables are almost surely bounded. By assumption (A5- 
ter), all functions are Lipschitz, thus let Lg, Lk^ L^^ be the Lipschitz constants on the compact 
space B (see Equation ([T])) for K and it respectively and L = max{ig, Lk, L^^}. The main idea 
is to replace all equalities in the proof of all CLT theorems by inequalities. For instance, in Theorem 
[3 Equation ^ is replaced by the following statement: for all xi . . .xt € M*, 

limsup^ P(ViV(|| {M^, C^) - (mo, co)|U, • • • , 

\m^,C^)-{muCt)\\^) > {x^...xt)) (19) 
<P((||Go||oo,...,||Gt||oo)<(a;i...a;t)) 
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where the variables Gt have covariance Ft — L'^Tt^i + -Dt_i. The other steps in the proof can be 
changed in almost the same way. Formula in Corollary [7| is replaced by 

(20) 



^st Et=0 -^ll^tlloo 



and Formula pS]) of Corollary [S] by 



< allGolloo + (5, a.s 



(21) 



where a and 5 are constants depending on L. □ 
4.3 Optimal policy for the deterministic limit 

As the evolution of the sources and of the processors does not depend on the environment, for all i, 
t, the quantities HiXl and Yt converge almost surely to deterministic values that we call x\ and yt- 
If yl is the number of packets distributed to the iih queue at time t, cj_|_i = {c\ + yl — x^)'^ . The 
deterministic optimization problem is to compute 



Vi--Vt t=l i=l ^ 



vi = yt 



Let us call wl the work done by the queue i at time t: vol = c\ — Ct-i + vl-i- The sum of the 
size of the queues at time t does not depend on with queue did the job but only on the quantity of 
work done: 



-t - E ^0 - E 

i—l i—1 u<tA 



Therefore to minimize the total cost, we have to maximize the total work done by the queues. Using 
this fact, the optimal strategy can be computed by iteration of a greedy algorithm. 
The principle of the algorithm is the following. 

1. The processors in all queues, which are "on" at time t with a speed fi are seen as slots of size 
/i. 

2. At each time t, yt units of tasks have to be allocated. This is done in a greedy fashion by 
filling up the empty slots starting from time t. Once all slots at time t are full, slots at time 
t + 1 are considered and are filled up with the remaining volume of tasks, and so forth up to 
time T. 

3. The remaining tasks that do not fit in the slots before T are allocated in an arbitrary fashion. 

See figure [5] for an illustration of the execution of the algorithm on an example. It should be 
clear that the algorithm is linear in the number of slots nk and that this algorithm computes an 
optimal allocation. 
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Figure 2: This figure presents an example of an execution of the algorithm. We consider a case 
with 3 queues. At t = (rcsp. 1,...,6) there are 8 (resp. 1,0,1,7,6,6) packets arriving in the 
system. Each processor has speed 1 and the processors in state "off" arc represented by grey cells 
(for example, at time 0, there are respectively 3,0 and 2 processors available in queue 1,2 and 3). 
All queues start at time with 2 packets. The top part of the table shows at which time a packet 
will be processed while the bottom part shows the corresponding optimal allocation (X represent 
tasks present in the queues before t = 0; A label Tj in a slot of queue j at time t represents one task 
arriving at time i allocated to queue j that will be processed at time t. The number of slots with 
label Ti should be equal to y^; At the end, 2 packets cannot be allocated in empty slots. They are 
routed arbitrarily (in queue 1)). 

4.4 Numerical example 

We consider a simple instance of the resource allocation problem with 5 queues. Initially, they have 
respectively 1,2,2,3 and 3 processors running at speed .5,.!,. 2, .3 and .4 respectively. There are 
3 initial sources. The transition matrices are time dependent and are chosen randomly before the 
excciition of the algorithm that is they are known for the computation of the optimal policy and 
arc the same for all experiments. We ran some simulations to compute the expected cost of different 
policies for various sizes of the system. We compare different policies: 

1. Deterministic policy a* - to obtain this curve, the optimal actions Oq . . .a^_i that the con- 
troller must take for the deterministic system have been computed. At time t, action is 
used regardless of the currently state, and the cost up to time T is displayed. 

2. Limit policy 11* - here, the optimal policy 11* for the deterministic case was first computed. 

When the stochastic system is in state (Mf', C^) at time t, we apply the action IIj (M^, C^) 
and the corresponding cost up to time T is reported. 

3. Join the Shortest Queue (JSQ) and Weighted Join the Shortest Queue (W-JSQ) - for JSQ, 
each packet is routed (deterministically) in the shortest queue. In W-JSQ, a packet is routed 
in the queue whose weighted queue size Bj/(/ijXj) is the smallest. 
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Figure 3: Expected cost of the policies a*, 11*, JSQ and W-JSQ for different values of N. 



The results are reported in Figures [3] and ID 

A series of several simulations for with different values of N was run. The reported values in 
the figures are the mean values of the waiting time over 10000 simulations for small values of A'' 
and around 200 simulations for big values of N. Over the whole range for N, the 95% confidence 
interval is less than 0.1% for the expected cost - figure [3]- and less than 5% for the central limit 
theorem - figure HI 

Figure [3] shows the average waiting time of the stochastic system when we apply the different 
policies. The horizontal line represents the optimal cost of the deterministic system i;*(too,co) 
which is probably less than V*^{Mo, Cq). This fi gure illustrates Theorem [3l if we apply a* or 11*, 
the cost converges to v*{mo, cq). 

In Figure m one can see that for low values of N, all the curves are not smooth. This behavior 
comes from the fact that when N is not very large with respect to Nq, there are at least ["^^J 
(resp. \_-^Pi\ ) particles that are sources (resp. processors in queue i) and the remaining particles are 
distributed randomly. The random choice of the remaining states are chosen so that E[^^] = -^A, 
but the difference — NNqA may be large. Therefore, for some N the load of the system is much 
higher than the average load, leading to larger costs. As A^ grows, the proportion of remaining 
particles decreases and the phenomena becomes negligible. 
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A second feature that shows in Figure [31 is the fact that on aU curves, the expected waiting 
times are decreasing when N grows. This behavior is certainly related to Ross conjecture [15] that 
says that for a given load, the average queue length decreases when the arrival and service processes 
are more deterministic. 

Finally, the most important information on this figure is the fact that the optimal deterministic 
policy and the optimal deterministic actions perform better than JSQ and weighted JSQ as soon 
as the total number of elements in the system is over 200 and 50 respectively. The performance of 
the deterministic policy a* is quite far from W-JSQ and JSQ for small values of N, and it rapidly 
becomes better than JSQ (iV > 30) and W-JSQ {N > 200). Meanwhile the behavior of H* is 
uniformly good even for small values of iV. 
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Figure 4: Speed of convergence of the policies X — a* or 11* for different values of N. 

The figure H] illustrates Theorem d] which says that the speed of convergence towards the limit 
is of order \/N. On the y-axis, \/N times the average cost of the system minus the optimal 
deterministic cost is plotted. One can see that the gap between the expected cost of the policy 
n* (resp. a*) and the deterministic cost u*(mo,co) is about (resp. 400/\/]V) when N is 

large. This should be an upper bound on the constant S defined in Equation (|2ip . 

Besides comparing a* and H* to other heuristics, it would be interesting to compare it to 
the optimal policy of the stochastic system, whose cost is V*^ {M,G). One way to compute this 
optimum would be by using Equation ([3]). However to do so, one needs to solve it for all possible 
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values of M and C. In this example, C can be as large as the length of the five queues and each 
particle's state can vary in {on, off}. Therefore even with = 10 and if we only compute the cost 
for queues of size less than 10, this leads to 2^10^ « 10® states which is hard to handle even with 
powerful computers. 



5 Computational issues 

Throughout the paper, we have shown that if the controller uses the optimal policy 11* of the 
deterministic limit of the finite real system, the expected cost will be close to the optimal one 
(Theorem [3]). Moreover, Theorem |4] gives a bound on the error that we make. However to apply 
these results in practice, a question remains: how difhcult is it to compute the optimal limit policy? 

The first answer comes straight from the example. In many cases, even if the stochastic system 
is extremely hard to solve, the deterministic limit is often much simpler. The best case of course 
is, as in the example of section 21 when one can compute the optimal policy. If one can not 



compute it, there might also exist approximation policies with bounded error (see 11| for a review 
on the subject). Imagine that a 2-approxiniation algorithm exists for the deterministic system, then. 
Theorem [3] proves that for all e, this algorithm will be a (2+e)-approximation for the stochastic 
system if N is large enough. Finally, heuristics for the deterministic system can also be applied to 
the stochastic version of the system. 

If none of this works properly, one can also compute the optimal deterministic policy by "brute- 
force" computations using Equation ([3]): wj!" ^(m, c) — rt{m,c) + sup^j^v^^^ ^ (<I>a(m, c)). In that 
case, an approximation of the optimal policy is obtained by discrctizing the state space and by 
solving the equation backward (from t — T to t — Q),to obtain the optimal policy for all states. The 
brute force approach can also be applied directly on the stochastic equation using ([2]): Vj*^(M, C) — 

rt(M, C) -I- sup^g^EM,c ^+1 ri^a i^T^)) ■ However, solving the deterministic system has three 
key advantages. The first one is that the size of the discretized deterministic system may have 
nothing to do with the size of the original state space for N particles: it depends mostly on the 
smoothness of functions g and (j) rather than on N. The second one is the suppression of the 
expectation which might reduce the computational time by a polynomial factoi0 by replacing the 
|PAr(5)| possible values of M^^^ by 1. The last one is that the suppression of this expectation allows 
one to carry the computation going forward rather than backward. This latter point is particularly 
useful when the action set and the time horizon are small. 



6 Conclusion and future work 

In this paper, we have shown how the mean field framework can be used in an optimization context: 
the results known for Markov chains can be transposed almost unchanged to Markov decision 
processes. We further show that the convergence to the mean field limit in both cases (Markovian 
and Markovian with controlled variables) satisfies a central limit theorem, providing insight on the 
speed of convergence. 

We are currently investigating several extensions of these results. First, if one allows the actions 
to depend on the particles, it seems natural that the limit behavior of such systems is the same as 
the limit behavior of systems where the actions are random variables and that they both converge 

-"-The size of ¥pf(S) is the binomial coefficient (A''-|-l+S', S) ~jv^oo 
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to mean field system whose cost is averaged. Another possible direction is to consider stochastic 
systems where the event rate depends on N . In such cases the deterministic limits are given by 
differential equations and the speed of convergence can also be studied. 
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